---
title: "EIA-906 & EIA-923 Data"
subtitle: "Annual plant net generation"
author: "Jack Gregory"
date: "21 July 2021"
output:
  bookdown::html_document2:
    toc: yes
    toc_depth: 3
    toc_float: yes
    fig_caption: yes
    highlight: textmate
    keep_md: true
  pdf_document:
    latex_engine: pdflatex
    highlight: haddock
    keep_md: true
---
___

```{r preamble, include=FALSE}

## Initiate 
## ... Packages
pkgs <- c(
  "knitr",                            # Reproducible reporting
  "here",                             # File system
  "httr","rvest",                     # Data scraping
  "geofacet",                         # Spatial data
  "writexl"                           # Write data
)
install.packages(setdiff(pkgs, rownames(installed.packages())))
lapply(pkgs, library, character.only = TRUE)
rm(pkgs)

knitr::opts_chunk$set(
  echo = TRUE,
  message=FALSE, 
  warning=FALSE,
  fig.topcaption=TRUE, 
  fig.show="hold",
  fig.align="center",
  fig.width=7, 
  fig.height=4.6)

source(here("src/preamble.R"))

## ... Files
l.file <- list(
  fuel = here("data/eia923_fuel_dictionary.csv"),
  gen = here("data/eia_netgen.csv"),
  genunit = here("data/eia_netgen_unit.csv")
)

## ... URLs
l.url <- list(
  eia_906 = "https://www.eia.gov/electricity/data/eia923/eia906u.php",
  eia_923 = "https://www.eia.gov/electricity/data/eia923/"
)
```


# Objective

This workbook forms a part of the data cleaning process related to the Grandfathering project.  It describes Energy Information Administration (EIA) Form 906 and Form 923 data.

The EIA-906, EIA-920, EIA-923 and predecessor forms provide monthly and annual data on generation and fuel consumption at the power plant and prime mover level.  A subset of plants -- steam-electric plants 10 MW and above -- also provide boiler level and generator level data.  Power plant data prior to 2001 are separate files for utility and nonutility plants.  For 2001 data and subsequent years, the data are Excel spreadsheet files that include data for all plants and make other changes to the presentation of the data.  The data are sourced from Schedules 3A & 5A: generator data including generation, fuel consumption and stocks.

The data is (theoretically) a balanced panel of plant by fuel by month net generation.  Data for utility plants are available from 1970, and for nonutility plants from 1999.  It includes the  following variables:

- plant identifiers, including Office of the Regulatory Information System PLant (ORISPL) codes;
- operator identifiers;
- plant location;
- fuel type;
- consumed fuel in physical and energy units;
- heat content; and,
- net generation.

The EIA-906 and EIA-923 data along with additional information can be found at the following links:

- [Description](https://www.eia.gov/Survey)
- [EIA-906](https://www.eia.gov/electricity/data/eia923/eia906u.php)
- [EIA-923](https://www.eia.gov/electricity/data/eia923)

Further information of plant identifiers can be found using the EPA-EIA Crosswalk:

- [Description](https://www.epa.gov/airmarkets/power-sector-data-crosswalk)
- [GitHub](https://github.com/USEPA/camd-eia-crosswalk)

The remainder of this workbook is organized as follows.  First, we download and import the necessary data.  Next, we clean the data while also providing a running commentary for replication.  Finally, we export the data to a csv file.


# Download

We first download the set of publicly available xls files and zip archives storing the data here:

- `r here("data/eia/f923")`

Note that the following code only runs if the number of online files exceeds the number previously downloaded.

```{r download-zip-fn, include=FALSE}

## Define download function
download_file <- function(url) {

  ## Assertions
  stopifnot(
    stringr::str_detect(url, "https://www.eia.gov/electricity/data/eia923/"),
    fs::path_ext(url) %in% c("xls","zip")
  )

  cat(fs::path_file(url), "\n", sep="")

  ## Define destination
  file <- here("data/eia/f923", fs::path_file(url))

  ## Download file
  httr::RETRY("GET", url=url, times=5) %>%
    content("raw") %>%
    writeBin(file)
}
```

```{r download}

## Scrape list of EIA-906 files
l.url_906 <- l.url$eia_906 %>%
  httr::RETRY("GET", url=., times=5) %>%
  read_html() %>%
  rvest::html_nodes("a") %>%
  rvest::html_attr("href") %>%
  as_tibble() %>%
  filter(str_detect(value, "\\.xls$")) %>%
  mutate(year = str_extract(value, "(?<=f759)\\d{4}") %>% as.numeric()) %>%
  filter(year>=1997) %>%
  pull(value) %>%
  as.list() %>%
  map(~xml2::url_absolute(.x, l.url$eia_906))

## Scrape list of EIA-923 files
l.url_923 <- l.url$eia_923 %>%
  httr::RETRY("GET", url=., times=5) %>%
  read_html() %>%
  rvest::html_nodes("a") %>%
  rvest::html_attr("href") %>%
  as_tibble() %>%
  filter(str_detect(value, "\\.zip$")) %>%
  mutate(year = str_extract(value, "\\d{4}(?=\\.)") %>% as.numeric()) %>%
  filter(year>=1999 & year<=2019) %>%
  pull(value) %>%
  as.list() %>%
  map(~xml2::url_absolute(.x, l.url$eia_923))

## Combine lists
l.url_dl <- append(l.url_906, l.url_923)

## Perform download, if necessary
nfiles <- fs::dir_ls(here("data/eia/f923"), regexp="(906|906920|923)_")
if (length(l.url_dl)>(length(nfiles) + 6)) {
  walk(l.url_dl, download_file)
}
```


# Generation and Fuel Data

This section is dedicated to importing, cleaning and exporting the data from the worksheet "Page 1 Generation and Fuel Data."  It includes net generation by ORISPL, prime mover, fuel and month.

## Import

Next, we import the data.  This process is made more challenging due to the changes in EIA-906 and EIA-923 reporting format experienced throughout their existence.  The Notes subsection briefly describes these changes, while the Data subsection performs the import and cleaning.

### Notes

Throughout the existence of the Fuel Receipts and Costs survey, there have been a set of consistent variables.  Over time, these have been supplemented with additional descriptors.  The complete set of variables is presented in the table below.

| **Variable** | **Unit** | **Description** |
|--------------|----------|-----------------|
| Plant ID || ORISPL code as a unique one- to five-digit number. |
| Plant Name || Plant name. |
| Operator Name || Operator name. |
| Operator ID || Operator code as a unique one- to five-digit number. |
| State || Plant state code represented by a unique two-letter abbreviation. |
| Census Region || Census region represented by a unique three- to four-letter abbreviation. |
| NERC Region || NERC region represented by a unique four-letter abbreviation. |
| Combined Heat & Power Plant || Indicator of whether the plant is a combined heat and power station in set {Y, N} |
| Reported Prime Mover || Specific prime mover code represented by a unique two-letter abbreviation. |
| Reported Fuel Type Code || Specific fuel type code represented by a unique two- to three-letter abbreviation. |
| AER Fuel Type Code || Specific fuel type code from the AER represented by a unique two- to three-letter abbreviation |
| Physical Unit Label || Physical units. |
| Total Quantity Consumed in Physical Units | Fuel specific | Total quantity consumed for electric generation and useful thermal output in a yearly and wide by month format. |
| Quantity Consumed in Physical Units for Electric Generation | Fuel specific | Quantity consumed for electric generation only in a yearly and wide by month format. |
| Heat Content of Fuels | MMBtu per unit (Fuel specific) | Average heat content of fuel in a yearly and wide by month format. |
| Total Fuel Consumed | MMBtu | Total fuel consumed for electric generation and useful thermal output in a yearly and wide by month format. |
| Quantity Consumed For Electricity | MMBtu | Quantity consumed for electric generation only in a yearly and wide by month format. |
| Electricity Net Generation | MWh | Net electricity generation in a yearly and wide by month format. |

```{r months, include=FALSE}

df.months <- tribble(
  ~MONTH, ~MTH_ABBR, ~MTH, 
  "January", "JAN", 1,
  "February", "FEB", 2,
  "March", "MAR", 3,
  "April", "APR", 4,
  "May", "MAY", 5,
  "June", "JUN", 6,
  "July", "JUL", 7,
  "August", "AUG", 8,
  "September", "SEP", 9,
  "October", "OCT", 10,
  "November", "NOV", 11,
  "December", "DEC", 12
)
```

```{r states, include=FALSE}

df.state <- tribble(
  ~STATE, ~FIPS,
  'AL', 1,
  'AK', 2,
  'AZ', 4,
  'AR', 5,
  'CA', 6,
  'CO', 8,
  'CT', 9,
  'DE', 10,
  'DC', 11,
  'FL', 12,
  'GA', 13,
  'HI', 15,
  'ID', 16,
  'IL', 17,
  'IN', 18,
  'IA', 19,
  'KS', 20,
  'KY', 21,
  'LA', 22,
  'ME', 23,
  'MD', 24,
  'MA', 25,
  'MI', 26,
  'MN', 27,
  'MS', 28,
  'MO', 29,
  'MT', 30,
  'NE', 31,
  'NV', 32,
  'NH', 33,
  'NJ', 34,
  'NM', 35,
  'NY', 36, 
  'NC', 37,
  'ND', 38,
  'OH', 39,
  'OK', 40,
  'OR', 41,
  'PA', 42,
  'RI', 44,
  'SC', 45,
  'SD', 46,
  'TN', 47,
  'TX', 48,
  'UT', 49,
  'VT', 50,
  'VA', 51,
  'WA', 53,
  'WV', 54,
  'WI', 55,
  'WY', 56
)
```

```{r prime-mover, include=FALSE}

df.pm <- tribble(
  ~PM, ~PM_CODE, ~PM_DESC,
  "HY", 1, "Hydro",
  "ST", 2, "Steam",
  "IC", 3, "Internal Combustion",
  "GT", 4, "Gas Turbine",
  "CC", 5, "Combined Cycle (Steam)",
  "CT", 6, "Combined Cycle (Gas Turbine)",
  "WD", 7, "Wood Turbine",
  "SO", 8, "Solar"
)
```

```{r fuel, include=FALSE}

df.fuel <- read_csv(l.file$fuel) %>%
  rename(YR = YEAR)
```

### Data

Next, we import the net generation data from the xls files and zip archives into R.

```{r wrangle-fn-1, include=FALSE}

## Define function
wrangle <- function(df) {
  
  if (!exists("df.months")) {stop("The dataframe df.months does not exist.")}
  if (!exists("df.state")) {stop("The dataframe df.state does not exist.")}
  if (!exists("df.pm")) {stop("The dataframe df.pm does not exist.")}
  if (!exists("df.fuel")) {stop("The dataframe df.fuel does not exist.")}
  
  UseMethod("wrangle")
}

## Define CENSUS method (1997-2000)
wrangle.CENSUS <- function(df) {
  df %>%
    select(ORISPL = PCODE,
           NAME = PLTNAME,
           OPERATOR_ID = UTILCODE,
           OPERATOR = UTILNAME,
           FIPS = FIPST,
           PM = PMDESC,
           FUEL_CODE = FUELDESC,
           YR = YEAR,
           starts_with("GEN")) %>%
    pivot_longer(cols=starts_with("GEN"), names_to="MTH", names_prefix="GEN", 
                 values_to="NETGEN") %>%
    mutate(YR = case_when(str_length(YR)==2 ~ as.numeric(paste0(19, YR)),
                          YR==0 ~ 2000,
                          TRUE ~ YR)) %>%
    mutate_at(vars(ORISPL, OPERATOR_ID, FIPS, YR, MTH, NETGEN), as.numeric) %>%
    mutate(NETGEN = ifelse(is.na(NETGEN), 0, NETGEN)) %>%
    left_join(df.state, by=c("FIPS")) %>%
    left_join(df.fuel %>% select(YR, FUEL, FUEL_CODE), by=c("YR","FUEL_CODE")) %>%
    select(ORISPL, NAME, OPERATOR_ID, OPERATOR, STATE, PM, FUEL, FUEL_CODE, YR, MTH, NETGEN)
}

## Define YEAR method (1999-2000)
wrangle.YEAR <- function(df) {
  df %>%
    select(ORISPL = FACILITYID,
           NAME = FACILNAME,
           STATE = FACILSTATE,
           PM_CODE = PRIMEMOVER,
           FUEL_SYM = FUELTYPE,
           GENTYPE = GENERATYPE,
           YR = YEAR,
           ends_with("GENERAT")) %>%
    pivot_longer(cols=ends_with("GENERAT"), names_to="MTH_ABBR", values_to="NETGEN") %>%
    mutate(MTH_ABBR = str_extract(MTH_ABBR, ".*(?=GENERAT)")) %>%
    mutate_at(vars(ORISPL, PM_CODE, YR, NETGEN), as.numeric) %>%
    mutate(NETGEN = ifelse(is.na(NETGEN), 0, NETGEN/10^3)) %>%
    left_join(df.months %>% select(MTH_ABBR, MTH), by=c("MTH_ABBR")) %>%
    left_join(df.pm %>% select(-PM_DESC), by=c("PM_CODE")) %>%
    mutate(FUEL_SYM = ifelse(PM=="HY", "0", FUEL_SYM)) %>%
    left_join(df.fuel %>% select(YR, FUEL, FUEL_CODE, FUEL_SYM), by=c("YR","FUEL_SYM")) %>%
    filter(GENTYPE=="N") %>%
    select(ORISPL, NAME, STATE, PM, FUEL, FUEL_CODE, YR, MTH, NETGEN)
}

## Define current method (2001-present)
wrangle.current1 <- function(df) {
  df <- df %>%
    select(ORISPL = ...1,
           NAME = ...4,
           OPERATOR_ID = ...6,
           OPERATOR = ...5,
           STATE = ...7,
           PM = ...14,
           FUEL_CODE = ...15,
           YR = ...97,
           NETGEN_1 = ...80,
           NETGEN_2 = ...81,
           NETGEN_3 = ...82,
           NETGEN_4 = ...83,
           NETGEN_5 = ...84,
           NETGEN_6 = ...85,
           NETGEN_7 = ...86,
           NETGEN_8 = ...87,
           NETGEN_9 = ...88,
           NETGEN_10 = ...89,
           NETGEN_11 = ...90,
           NETGEN_12 = ...91) %>%
    filter(!(is.na(NAME) | str_to_upper(NAME)=="PLANT NAME")) %>%
    pivot_longer(cols=starts_with("NETGEN"), names_to="MTH", names_prefix="NETGEN_", 
                 values_to="NETGEN") %>%
    mutate_at(vars(ORISPL, OPERATOR_ID, YR, MTH, NETGEN), as.numeric) %>%
    mutate(NETGEN = ifelse(is.na(NETGEN), 0, NETGEN)) %>%
    left_join(df.fuel %>% select(YR, FUEL, FUEL_CODE), by=c("YR","FUEL_CODE")) %>%
    select(ORISPL, NAME, OPERATOR_ID, OPERATOR, STATE, PM, FUEL, FUEL_CODE, YR, MTH, NETGEN)
}
```

```{r extract-fn-1, include=FALSE}

## Define function
extract_zip <- function(zip, file, ...) {
  
  UseMethod("extract_zip", file)
}

## Define Page 1 method: Generation and Fuel Data
extract_zip.p1 <- function(zip, file, sheet) {
  
  ## Assertions
  stopifnot(
    fs::is_file(zip),
    fs::path_ext(zip)=="zip",
    fs::path_ext(file) %>% str_to_lower() %in% c("xls","xlsx"),
    is.logical(sheet)
  )
  
  cat(fs::path_file(zip), "\n", sep="")
  
  ## IMPORT DATA
  ## Unzip data
  unzip(zip, files=file, exdir=here("data/eia/f923"))
  
  ## Import data
  if (!sheet) {
    df <- readxl::read_excel(here("data/eia/f923", file), guess_max=10^5)
  } else {
    sheet <- readxl::excel_sheets(here("data/eia/f923", file)) %>%
      purrr::keep(str_detect(., "Page 1 Gen"))
    df <- readxl::read_excel(here("data/eia/f923", file), sheet=sheet, 
                             col_names=FALSE, guess_max=10^5)
  }
  
  ## Delete unzipped data
  file %>%
    fs::path_split() %>%
    purrr::flatten() %>%
    .[[1]] %>%
    fs::here("data/eia/f923", .) %>%
    fs::file_delete()

  ## CLEAN DATA
  ## Set df type
  type <- names(df)[[1]] %>%
    stringr::str_to_upper() %>%
    stringr::str_replace_all(c("\\."="", "_"="")) %>%
    stringr::str_extract("^[A-Z]*")
  if (type=="") type <- "current1"
  df <- structure(df, class=c(type, class(df)))
  
  ## Wrangle data based on type; produce error if cleaning fn does not exist
  df <- tryCatch(
    wrangle(df),
    error = function(e) {
      stop(glue("Cleaning function does not currently run or exist for the last imported year.
                 wrangle.{type} {e}"))
    }
  )
  return(df)
}
```

``` {r import-1}

## Build zip list
l.zip1 <- fs::dir_ls(here("data/eia/f923"), regexp="f(906|906920|923)_\\d{4}\\.zip$")

## Build zip & file dataframe
df.zip1 <- l.zip1 %>%
  map_dfr(~utils::unzip(.x, list=TRUE) %>% mutate(Zip = .x)) %>%
  rename_with(.fn=str_to_upper) %>%
  mutate(S3 = case_when(str_detect(NAME, "[fF](759|906).*\\d{4}(mu|nu|)\\.xls") ~ 1,
                        str_detect(NAME, "eia923December2008\\.xls") ~ 1,
                        str_detect(NAME, "EIA923( SCHEDULES |_Schedules_)2_3_4_5.*\\.(xls|XLS|xlsx)") ~ 1,
                        TRUE ~ 0),
         YR = str_extract(NAME, "(19|20)\\d{2}") %>% as.numeric(),
         SHEET = ifelse(YR>=2001, TRUE, FALSE)) %>%
  filter(S3==1) #%>%
  #filter(SHEET)

## Modify NAME class within zip dataframe
df.zip1 <- df.zip1 %>%
  mutate(NAME = map(NAME, ~structure(.x, class=c(class(.x), "p1"))))

## Import dataset
if ((length(l.zip1)+2)!=nrow(df.zip1)) {
  stop("Number of data files does not match the number of zip archives.")
}
df.gen <- pmap_dfr(df.zip1 %>% select(ZIP, NAME, SHEET),
                     ~extract_zip(..1, ..2, ..3))
```


## Clean

At present, we perform the following to clean the dataset:

- Correct `FUEL` and `FUEL_CODE` errors;
- Drop NJ plants with equivalents in PA;
- Drop unnecessary observations, including non-contiguous states and irrelevant ORISPL codes;
- Drop all `FUEL`s other than biomass, coal, gas and oil;
- Drop irrelevant prime movers including CE and FC;
- Drop all data prior to 2003, as a result of missing prime mover data from the 2001 & 2002 data years; and,
- Aggregate over `ORISPL`, `YR` and `MTH`.

```{r clean-1}

## Check for incorrect FUEL_CODEs
## NB - Run prior to "Correct FUEL & FUEL_CODE errors" line below
# df.test <- df.gen_cln %>% filter(is.na(FUEL)) %>%
#     filter(!(ORISPL %in% c(9999,99999) | is.na(ORISPL)))
# df.test_distinct <- df.test %>% distinct(FUEL_CODE, YR) %>% arrange(FUEL_CODE, YR)

## Clean fuel data
df.gen_cln <- df.gen %>%

  ## Correct FUEL & FUEL_CODE errors
  mutate(FUEL = case_when(FUEL_CODE=="OW" ~ "Oil",
                          FUEL_CODE=="WD" ~ "Biomass",
                          FUEL_CODE=="WT" ~ "Other",
                          TRUE ~ FUEL),
         FUEL_CODE = case_when(FUEL_CODE=="OW" ~ "WO",
                               FUEL_CODE=="WD" ~ "WOD",
                               FUEL_CODE=="WT" ~ "WH",
                               TRUE ~ FUEL_CODE),
         FUEL = case_when(ORISPL==55824 & is.na(FUEL_CODE) ~ "Oil", 
                          ORISPL==55860 & is.na(FUEL_CODE) ~ "Water",
                          ORISPL %in% c(56002,56003) & is.na(FUEL_CODE) ~ "Wind",
                          PM=="SO" & FUEL_CODE=="N/A" ~ "Solar",
                          ORISPL==2039 & FUEL_CODE=="N/A" ~ "Oil",
                          ORISPL %in% c(2277,4057) & FUEL_CODE=="N/A" ~ "Other",
                          ORISPL==3069 & FUEL_CODE=="N/A" ~ "Biomass",
                          ORISPL==10282 & is.na(FUEL_CODE) ~ "Hydro",
                          ORISPL==10490 & is.na(FUEL_CODE) ~ "Hydro",
                          ORISPL==50242 & is.na(FUEL_CODE) ~ "Hydro",
                          ORISPL==50952 & is.na(FUEL_CODE) ~ "Gas",
                          ORISPL==54691 & is.na(FUEL_CODE) ~ "Hydro",
                          ORISPL==54825 & is.na(FUEL_CODE) ~ "Oil",
                          TRUE ~FUEL),
         FUEL_CODE = case_when(ORISPL==55824 & is.na(FUEL_CODE) ~ "DFO", 
                               ORISPL==55860 & is.na(FUEL_CODE) ~ "WAT",
                               ORISPL %in% c(56002,56003) & is.na(FUEL_CODE) ~ "WND",
                               PM=="SO" & FUEL_CODE=="N/A" ~ "SUN",
                               ORISPL==2039 & FUEL_CODE=="N/A" ~ "RFO",
                               ORISPL %in% c(2277,4057) & FUEL_CODE=="N/A" ~ "TIR",
                               ORISPL==3069 & FUEL_CODE=="N/A" ~ "WOD",
                               ORISPL==10282 & is.na(FUEL_CODE) ~ "WAT",
                               ORISPL==10490 & is.na(FUEL_CODE) ~ "WAT",
                               ORISPL==50242 & is.na(FUEL_CODE) ~ "WAT",
                               ORISPL==50952 & is.na(FUEL_CODE) ~ "NG",
                               ORISPL==54691 & is.na(FUEL_CODE) ~ "WAT",
                               ORISPL==54825 & is.na(FUEL_CODE) ~ "DFO",
                               TRUE ~ FUEL_CODE)) %>%
  
  ## Correct state errors
  filter(!(STATE=="NJ" & ORISPL %in% c(3113,3115,3118,3130,3131,3132,3136))) %>%
  
  ## Drop unnecessary obs
  filter(!is.na(FUEL_CODE)) %>%
  filter(!(STATE %in% c("AK","HI"))) %>%
  filter(!(ORISPL %in% c(9999,99999) | is.na(ORISPL))) %>%

  ## Keep only relevant fuels and prime movers
  filter(FUEL %in% c("Biomass","Coal","Gas","Oil")) %>%
  filter(!(PM %in% c("CE","FC"))) %>%
  
  ## Drop all values prior to 2003
  filter(YR>=2003) %>%
  
  ## Aggregate coal NETGEN
  group_by(ORISPL, STATE, PM, FUEL, YR, MTH) %>%
  summarise(NETGEN = sum(NETGEN, na.rm=TRUE), .groups="drop")
```


## Analysis

```{r theme, include=FALSE}

theme_eia <- function() {
  theme_classic() +
  theme(strip.background = element_rect(fill="grey85", color=NA),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major.y = element_line(colour="grey85", size=0.3),
        panel.grid.minor.y = element_line(colour="grey85", size=0.3),
        axis.line.x = element_line(size=0.4),
        axis.ticks.x = element_line(size=0.4),
        axis.text.x = element_text(angle=50, hjust=1),
        legend.position = "right",
        plot.caption = element_text(hjust=0))
}
```

```{r analysis-gen-state, fig.height=10, fig.width=14}

# df.gen_st <- df.gen_cln %>%
#       group_by(FUEL, STATE, YR, MTH) %>%
#       summarise(NETGEN = sum(NETGEN, na.rm=TRUE)/10^6, .groups="drop") %>%
#       mutate(YM = ymd(paste(YR, MTH, "01", sep="-")))

map(list("Coal","Gas","Oil"),
    ~df.gen_cln %>%
      group_by(FUEL, STATE, YR, MTH) %>%
      summarise(NETGEN = sum(NETGEN, na.rm=TRUE)/10^6, .groups="drop") %>%
      mutate(YM = ymd(paste(YR, MTH, "01", sep="-"))) %>%
      filter(FUEL==.x) %>%
      ggplot() +
        geom_line(aes(x=YM, y=NETGEN), size=0.5) +
        # scale_color_viridis_d(option="inferno", begin=0.1, end=0.55) +
        facet_geo(~ STATE, grid="us_state_contiguous_grid1") +
        labs(title=paste0(.x, " net generation by state"),
             y="Net generation [TWh]",
             x="Month",
             color="") +
        theme_eia() +
        guides(color = guide_legend(override.aes=list(size=2)))
)
```

```{r analysis-gen-state-relative, fig.height=10, fig.width=14}

map(list("Coal","Gas","Oil"),
    ~df.gen_cln %>%
      group_by(FUEL, STATE, YR, MTH) %>%
      summarise(NETGEN = sum(NETGEN, na.rm=TRUE)/10^6, .groups="drop") %>%
      mutate(YM = ymd(paste(YR, MTH, "01", sep="-"))) %>%
      filter(FUEL==.x) %>%
      ggplot() +
        geom_line(aes(x=YM, y=NETGEN), size=0.5) +
        # scale_color_viridis_d(option="inferno", begin=0.1, end=0.55) +
        facet_geo(~ STATE, grid="us_state_contiguous_grid1", scales="free_y") +
        labs(title=paste0(.x, " net generation by state"),
             y="Net generation [TWh]",
             x="Month",
             color="") +
        theme_eia() +
        theme(axis.text.y = element_blank()) +
        guides(color = guide_legend(override.aes=list(size=2)))
)
```


## Export

### Data
This section exports the cleaned net generation by ORISPL data to a csv and/or an rds file:

```{r export-1}

# write_csv(df.gen_cln, l.file$gen)
saveRDS(df.gen_cln, fs::path_ext_set(l.file$gen, "rds"))
```

### Dictionary

The following table describes the columns in the exported dataset.

| **Variable** | **Unit** | **Description** |
|------------------------|-----------------|
| ORISPL || ORISPL code as a unique two- to five-digit integer. |
| STATE || Plant state code represented by a unique two-letter abbreviation. |
| PM || Prime mover in set {CA, CC, CS, CT, GT, IC, OT, ST} |
| FUEL || Fuel type in set {Biomass, Coal, Gas, Oil} |
| YR || Data year from 1997 to 2019. |
| MTH || Data month as a one- or two-digit integer. |
| NETGEN | MWh | Net electricity generation in a particular year-month over all `FUEL_CODE`s in the respective `FUEL` type. |


# Generator Data

This section is dedicated to importing, cleaning and exporting the data from the worksheet "Page 4 Generator Data."  It includes net generation by ORISPL, unit and month.

## Import

Next, we import the data.  The Notes subsection briefly describes the raw data, while the Data subsection performs the import and cleaning.

### Notes

The "Page 4" worksheet only commenced in 2008 and maintains a consistent format throughout.  The complete set of variables is presented in the table below.

| **Variable** | **Unit** | **Description** |
|--------------|----------|-----------------|
| Plant ID || ORISPL code as a unique one- to five-digit number. |
| Plant Name || Plant name. |
| Operator Name || Operator name. |
| Operator ID || Operator code as a unique one- to five-digit number. |
| State || Plant state code represented by a unique two-letter abbreviation. |
| Census Region || Census region represented by a unique three- to four-letter abbreviation. |
| NERC Region || NERC region represented by a unique four-letter abbreviation. |
| Combined Heat & Power Plant || Indicator of whether the plant is a combined heat and power station in set {Y, N} |
| Reported Prime Mover || Specific prime mover code represented by a unique two-letter abbreviation. |
| Generator ID || Generator code as a unique one- to five-alphanumeric. |
| Net Generation | MWh | Net electricity generation in a yearly and wide by month format. |


### Data

Next, we import the net generation data from the xls files and zip archives into R.

```{r wrangle-fn-4, include=FALSE}

## Define current method (2008-present)
wrangle.current4 <- function(df) {
  df <- df %>%
    select(ORISPL = ...1,
           NAME = ...3,
           OPERATOR_ID = ...5,
           OPERATOR = ...4,
           STATE = ...6,
           GENERATOR_ID = ...12,
           PM = ...13,
           YR = ...27,
           NETGEN_1 = ...14,
           NETGEN_2 = ...15,
           NETGEN_3 = ...16,
           NETGEN_4 = ...17,
           NETGEN_5 = ...18,
           NETGEN_6 = ...19,
           NETGEN_7 = ...20,
           NETGEN_8 = ...21,
           NETGEN_9 = ...22,
           NETGEN_10 = ...23,
           NETGEN_11 = ...24,
           NETGEN_12 = ...25) %>%
    filter(!(is.na(NAME) | str_to_upper(NAME)=="PLANT NAME")) %>%
    pivot_longer(cols=starts_with("NETGEN"), names_to="MTH", names_prefix="NETGEN_", 
                 values_to="NETGEN") %>%
    mutate_at(vars(ORISPL, OPERATOR_ID, YR, MTH, NETGEN), as.numeric) %>%
    mutate(NETGEN = ifelse(is.na(NETGEN), 0, NETGEN)) %>%
    select(ORISPL, NAME, OPERATOR_ID, OPERATOR, STATE, GENERATOR_ID, PM, YR, MTH, NETGEN)
}
```

```{r extract-fn4, include=FALSE}

## Define Page 1 method: Generation and Fuel Data
extract_zip.p4 <- function(zip, file) {
  
  ## Assertions
  stopifnot(
    fs::is_file(zip),
    fs::path_ext(zip)=="zip",
    fs::path_ext(file) %>% str_to_lower() %in% c("xls","xlsx")
  )
  
  cat(fs::path_file(zip), "\n", sep="")
  
  ## IMPORT DATA
  ## Unzip data
  unzip(zip, files=file, exdir=here("data/eia/f923"))
  
  ## Import data
  sheet <- readxl::excel_sheets(here("data/eia/f923", file)) %>%
    purrr::keep(str_detect(., "Page 4 Gen"))
  df <- readxl::read_excel(here("data/eia/f923", file), sheet=sheet, 
                           col_names=FALSE, guess_max=10^5)

  ## Delete unzipped data
  file %>%
    fs::path_split() %>%
    purrr::flatten() %>%
    .[[1]] %>%
    fs::here("data/eia/f923", .) %>%
    fs::file_delete()

  ## CLEAN DATA
  ## Set df type
  type <- names(df)[[1]] %>%
    stringr::str_to_upper() %>%
    stringr::str_replace_all(c("\\."="", "_"="")) %>%
    stringr::str_extract("^[A-Z]*")
  if (type=="") type <- "current4"
  df <- structure(df, class=c(type, class(df)))
  
  ## Wrangle data based on type; produce error if cleaning fn does not exist
  df <- tryCatch(
    wrangle(df),
    error = function(e) {
      stop(glue("Cleaning function does not currently run or exist for the last imported year.
                 wrangle.{type} {e}"))
    }
  )
  return(df)
}
```

``` {r import-4}

## Build zip list
l.zip4 <- fs::dir_ls(here("data/eia/f923"), regexp="f923_\\d{4}\\.zip$")

## Build zip & file dataframe
df.zip4 <- l.zip4 %>%
  map_dfr(~utils::unzip(.x, list=TRUE) %>% mutate(Zip = .x)) %>%
  rename_with(.fn=str_to_upper) %>%
  mutate(S3 = case_when(str_detect(NAME, "eia923December2008\\.xls") ~ 1,
                        str_detect(NAME, "EIA923( SCHEDULES |_Schedules_)2_3_4_5.*\\.(xls|XLS|xlsx)") ~ 1,
                        TRUE ~ 0),
         YR = str_extract(NAME, "(19|20)\\d{2}") %>% as.numeric()) %>%
  filter(S3==1)

## Modify NAME class within zip dataframe
df.zip4 <- df.zip4 %>%
  mutate(NAME = map(NAME, ~structure(.x, class=c(class(.x), "p4"))))

## Import dataset
if (length(l.zip4)!=nrow(df.zip4)) {
  stop("Number of data files does not match the number of zip archives.")
}
df.genunit <- pmap_dfr(df.zip4 %>% select(ZIP, NAME),
                       ~extract_zip(..1, ..2))
```


## Clean

At present, we perform the following to clean the dataset:

- Drop unnecessary observations, including non-contiguous states;

```{r clean-4}

## Clean fuel data
df.genunit_cln <- df.genunit %>%

  ## Drop unnecessary obs
  filter(!(STATE %in% c("AK","HI"))) %>%
  
  ## Finalize dataset
  select(ORISPL, GENERATOR_ID, STATE, PM, YR, MTH, NETGEN) %>%
  arrange(ORISPL, GENERATOR_ID, YR, MTH)
```


## Analysis

```{r analysis-genunit-state, fig.height=10, fig.width=14}

df.genunit_cln %>%
  group_by(STATE, YR, MTH) %>%
  summarise(NETGEN = sum(NETGEN, na.rm=TRUE)/10^6, .groups="drop") %>%
  mutate(YM = ymd(paste(YR, MTH, "01", sep="-"))) %>%
  ggplot() +
    geom_line(aes(x=YM, y=NETGEN), size=0.5) +
    facet_geo(~ STATE, grid="us_state_contiguous_grid1") +
    labs(title="Net generation by state",
         y="Net generation [TWh]",
         x="Month") +
    theme_eia()
```

```{r analysis-genunit-state-relative, fig.height=10, fig.width=14}

df.genunit_cln %>%
  group_by(STATE, YR, MTH) %>%
  summarise(NETGEN = sum(NETGEN, na.rm=TRUE)/10^6, .groups="drop") %>%
  mutate(YM = ymd(paste(YR, MTH, "01", sep="-"))) %>%
  ggplot() +
    geom_line(aes(x=YM, y=NETGEN), size=0.5) +
    facet_geo(~ STATE, grid="us_state_contiguous_grid1", scales="free_y") +
    labs(title="Net generation by state",
         y="Net generation [TWh]",
         x="Month") +
    theme_eia() +
    theme(axis.text.y = element_blank())
```


## Export

### Data
This section exports the cleaned net generation by unit data to a csv and/or an rds file:

```{r export}

# write_csv(df.genunit_cln, l.file$genunit)
saveRDS(df.genunit_cln, fs::path_ext_set(l.file$genunit, "rds"))
```

### Dictionary

The following table describes the columns in the exported dataset.

| **Variable** | **Unit** | **Description** |
|------------------------|-----------------|
| ORISPL || ORISPL code as a unique two- to five-digit integer. |
| GENERATOR_ID || Generator code as a unique one- to five-alphanumeric. |
| STATE || Plant state code represented by a unique two-letter abbreviation. |
| PM || Prime mover in set {CA, CS, CT, OT, ST} |
| YR || Data year from 1997 to 2019. |
| MTH || Data month as a one- or two-digit integer. |
| NETGEN | MWh | Net electricity generation in a particular year-month. |

